Automated part - of - speech analysis of Urdu : conceptual and technical issues

نویسنده

  • Andrew Hardie
چکیده

Part-of-speech (POS) tagging is the process of labelling tokens in a text with tags that indicate their morphosyntactic category, and has a wide range of applications in computational and corpus linguistics, such as the production of corpus-based dictionaries and grammars. This paper describes an experiment in extending POS tagging to a hitherto untagged language, Urdu. The most challenging task in POS tagging is disambiguation, i.e. the resolution of the contextual ambiguity of a token for which more than one tag is possible. Three important approaches to disambiguation have been developed: approaches based on rules devised by a linguist; probabilistic approaches based on the application of corpus-derived statistics in a mathematical model such as a Markov model; and Brill (1995)’s approach where rules are learned automatically from a corpus. However, given that only a small amount of pre-tagged data was available for Urdu, only the rulebased approach was appropriate for the Urdu tagger described here. A rule-based tagger for Urdu was created within the Unitag architecture, together with the requisite language-specific resources for Urdu (including a tagset, an analyser, a lexicon, and a rule list). An evaluation of the tagger suggests that it performs at a level of accuracy notably below that commonly reported for languages such as English. However, this poor performance is primarily attributable to the small size of the lexicon, which is attributable to the small quantity of training data available. The rule-based disambiguation rules was more successful.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Developing a tagset for automated part-of-speech tagging in Urdu

1. Abstract While part-of-speech tagging is an established technology for Western European languages such as English or Spanish, extending the technique to Urdu presents a range of interesting issues. There are some problems associated with the writing system, e.g. the problems of locating token boundaries in the Urdu version of the Arabic script. However, there are also linguistic issues. Litt...

متن کامل

Developing a tagset for automated part - of - speech tagging in Urdu Andrew

While part-of-speech tagging is an established technology for Western European languages such as English or Spanish, extending the technique to Urdu presents a range of interesting issues. There are some problems associated with the writing system, e.g. the problems of locating token boundaries in the Urdu version of the Arabic script. However, there are also linguistic issues. Little work has ...

متن کامل

Politeness Orientation in Social Hierarchies in Urdu

The present research is aimed at investigating how the politeness of the speakers of Urdu is influenced by their relative social status in society. The researcher took politeness theory of Brown and Levinson (1978, 1987) as a model. To observe politeness of Urdu speakers, speech act of apology with different strategies was selected. A Discourse Completion Task (DCT) was used as an instrument to...

متن کامل

Testing Problems in Russian as a Foreign Language in a Technical University

 Problems of theory and practice of the Russian as a foreign language testing for entrants in technical universities are considered. The benefits of test forms for controlling the foreign students’ skills in the Russian language during a hard time limit are presented. The structure and content of the tests, all types of tasks offered on the entrance and final examinations in the Russian languag...

متن کامل

Semi-Semantic Part of Speech Annotation and Evaluation

This paper presents the semi-semantic part of speech annotation and its evaluation via Krippendorff’s α for the URDU.KON-TB treebank developed for the South Asian language Urdu. The part of speech annotation with the additional subcategories of morphology and semantics provides a treebank with sufficient encoded information. The corpus used is collected from the Urdu Wikipedia and news papers. ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013